{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f113cd95-ad5e-46d5-87fe-5e55a486f635",
   "metadata": {},
   "source": [
    "# Data preprocessing\n",
    "\n",
    "In this tutorial, we will guide you through the preparation of a dataset for training.\n",
    "\n",
    "Same as in the [Quick start](./0.quick_start.ipynb), we will use the [Large Movie Review Dataset](https://huggingface.co/datasets/stanfordnlp/imdb) dataset.\n",
    "\n",
    "To preprocess the dataset, there are two available approaches:\n",
    "* Native MindSpore `Dataset` API\n",
    "* Modify the `BaseMapFunction` API in MindNLP\n",
    "\n",
    "While native MindSpore approach gives you more flexibility, the `BaseMapFunction` approach helps to wrap the code for better readability.\n",
    "\n",
    "In addition, working in Ascend or GPU environment brings subtle difference in the processing procedure, mainly due to different handling of dynamic shapes. We will see it as we proceed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "443c60e2-5bc6-4221-8ede-9e295ec6e6d7",
   "metadata": {},
   "source": [
    "## Load and split the dataset\n",
    "First, load the dataset from Hugging Face repository:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56f451ff-416c-4391-b8f9-00139ccb4910",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from mindnlp import load_dataset\n",
    "\n",
    "imdb_ds = load_dataset('imdb', split=['train', 'test'])\n",
    "imdb_train = imdb_ds['train']\n",
    "imdb_test = imdb_ds['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90437b35-05f2-40ec-b687-30a95e0e8943",
   "metadata": {},
   "source": [
    "`load_dataset` accepts a dataset name for fetching remotely from the Hugging Face repository, as well as a local path pointing to a dataset stored on disk.\n",
    "\n",
    "The `split` parameter informs `load_dataset` to fetch which split of the dataset. Here it will fetch both the training dataset ('train') and the test dataset ('test')."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37eec05e-67a5-4338-813f-b421908466ac",
   "metadata": {},
   "source": [
    "To further split training dataset into training and validation datasets, use the `.split()` method. The list of numbers specify the proportion of data entries going to each splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9f5e44ed-ffd5-41dc-9ea1-861551150341",
   "metadata": {},
   "outputs": [],
   "source": [
    "imdb_train, imdb_val = imdb_train.split([0.7, 0.3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3cb44ee-1c2b-451d-886f-a88078d45174",
   "metadata": {},
   "source": [
    "To have a peek into how the dataset looks like, get the first element from the iterator of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "358e587a-b921-4c24-9e91-f0bc54bc545c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'text': Tensor(shape=[], dtype=String, value= 'I chuckled a few times during this movie. I laughed out loud during the notarizing of the margarine company handover (pun intended).<br /><br />There are three segments in this movie. The first one is supposed to be a spoof of \"woman \\'grows up\\' and launches career\" movies. The Tampax® box was the funniest thing in this segment. Most of the cast members aren\\'t listed here on IMDb. They are the lucky ones. Few other people will be able to connect this thing to the ruin of their acting careers.<br /><br />The second segment is a spoof of \"sharkish woman sleeps her way to the top and seizes control of huge industry\" movies. Robert Culp has several funny moments, all physical humor, including the aforementioned handover. After his character dies the segment sinks lower and lower as Dominique Corsaire rises higher and higher. By the time she becomes First Lady I wanted to rip the cable out of the TV and watch \"snow.\" I switched to Pakistani music videos instead. I don\\'t understand Urdu, or whatever language the videos were in. It was still better than listening to the dialogue in this painfully dull \"story.\"<br /><br />Then came \"Municipalians\" with the *big* stars, half of them on screen for less than a minute: Elisha Cook, Jr., Christopher Lloyd, Rhea Perlman, Henny Youngman, Julie Kavner, Richard Widmark and ... *Robby Benson.* It\\'s supposed to be a spoof of \"young cop teams with hardened, substance abusing older cop who needs retirement *badly*\" movies. The horizontal flash bar on the police car is very impressive. It was interesting seeing old RTD buses, and a Shell gas station sign, and an American Savings sign -- none of them are around anymore. Nagurski\\'s \"Never stop anywhere you might have to get out the car\" made me smile momentarily. Then they discuss how boring the young cop is. A lot. Back and forth about how boring he is. That was as boring as this description of how boring it is. Nagurski\\'s Law Number Four, \"Never go into a music store that\\'s been cut into with an acetylene torch,\" made me think that the music store is a real business at the actual location the dispatcher gave. Thinking about that was more interesting than the set-up for the gag which followed. Young Falcone (Benson) gets shot. A lot. He becomes a hardened cop like Nagurski. The segment keeps going. On and on. And on. It won\\'t stop. It rolls relentlessly onward no matter how many times you wish he\\'d just *die* already so this thing will end. It doesn\\'t. It goes on and on and on.... Then a \"Buffy the Vampire Slayer\" episode which I\\'ve seen four times already comes on. Thank God! This abysmal movie ended while I went to get the mail.'), 'label': Tensor(shape=[], dtype=Int64, value= 0)}\n"
     ]
    }
   ],
   "source": [
    "print(next(imdb_train.create_dict_iterator()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a76b778f-17b8-4036-862f-deff2116b3c2",
   "metadata": {},
   "source": [
    "## Load the tokenizer\n",
    "A tokenizer converts raw text into a format that the corresponding model can process, which is crucial for natural language processing tasks.\n",
    "\n",
    "We make use of the `AutoTokenizer` from MindNLP to fetch and instantiate the appropriate tokenizer for a pre-trained model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "124981c4-99e0-44d7-a7c1-8b6120d0753a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from mindnlp.transformers import AutoTokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28f7518e-37be-4b39-a9e8-634fbe731e48",
   "metadata": {},
   "source": [
    "To get the corresponding tokenizer, you can supply the model name, in this case `'bert-base-cased'`, to the `AutoTokenizer.from_pretrained` method. It will then download the tokenizer required by your model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94359edc-eb93-4937-a16e-7435ebd8e0eb",
   "metadata": {},
   "source": [
    "## Preprocess with native MindSpore\n",
    "To preprocess the dataset with native MindSpore, we write a function `process_dataset` that comprises the crucial steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9b6f6857-866c-472c-889f-ab64eb68a228",
   "metadata": {},
   "outputs": [],
   "source": [
    "import mindspore\n",
    "import numpy as np\n",
    "from mindspore.dataset import GeneratorDataset, transforms\n",
    "\n",
    "def process_dataset(dataset: GeneratorDataset, tokenizer, max_seq_len=256, batch_size=32, shuffle=False, take_len=None):\n",
    "    is_ascend = mindspore.get_context('device_target') == 'Ascend'\n",
    "    # The tokenize function\n",
    "    def tokenize(text):\n",
    "        if is_ascend:\n",
    "            tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)\n",
    "        else:\n",
    "            tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)\n",
    "        return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']\n",
    "\n",
    "    # Shuffle the order of the dataset\n",
    "    if shuffle:\n",
    "        dataset = dataset.shuffle(buffer_size=batch_size)\n",
    "\n",
    "        # Select the first several entries of the dataset\n",
    "    if take_len:\n",
    "        dataset = dataset.take(take_len)\n",
    "\n",
    "    # Apply the tokenize function, transforming the 'text' column into the three output columns generated by the tokenizer.\n",
    "    dataset = dataset.map(operations=[tokenize], input_columns=\"text\", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])\n",
    "    # Cast the datatype of the 'label' column to int32 and rename the column to 'labels'\n",
    "    dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns=\"label\", output_columns=\"labels\")\n",
    "    # Batch the dataset with padding.\n",
    "    if is_ascend:\n",
    "        dataset = dataset.batch(batch_size)\n",
    "    else:\n",
    "        dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),\n",
    "                                                             'token_type_ids': (None, 0),\n",
    "                                                             'attention_mask': (None, 0)})\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b752889d-7d6c-452e-a659-f34a61e63669",
   "metadata": {},
   "source": [
    "Here is a breakdown of each step:\n",
    "* #### Tokenization\n",
    "The first step is tokenization. Tokenization converts the raw text into a format that can be fed into a machine learning model.\n",
    "\n",
    "Define the tokenize function to process the text in each row of the dataset.\n",
    "```python\n",
    "def tokenize(text):\n",
    "    tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)\n",
    "    return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']\n",
    "```\n",
    "\n",
    "Then make use of the `GeneratorDataset.map` API from MindSpore to map the tokenize operation onto all rows in the dataset. It will take the `\"text\"` column as input, tokenize it and return `\"input_ids\"`, `\"token_type_ids\"` and `\"attention_mask\"` columns as output.\n",
    "```python\n",
    "dataset = dataset.map(operations=[tokenize], input_columns=\"text\", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])\n",
    "```\n",
    "\n",
    "* #### Type casting\n",
    "In some cases, the datatype of a column needs to be casted to a different one. Here, the `\"label\"` column in our dataset is originally of type `Int64`. We create an operation using `mindspore.dataset.transform` to cast the datatype into `Int32`. Then map this operation onto each element in the `\"label\"` column.\n",
    "\n",
    "Notice that the `output_columns` is with name `\"labels\"`, instead of `\"label\"`, and hence we renamed the column to a new name after mapping.\n",
    "```python\n",
    "from mindspore.dataset import transforms\n",
    "dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns=\"label\", output_columns=\"labels\")\n",
    "```\n",
    "\n",
    "* #### Shuffling\n",
    "Shuffling the order of dataset entries is important to ensure that the model does not learn the order of the data, which could lead to overfitting. Shuffle the dataset with the `shuffle` method:\n",
    "```python\n",
    "dataset = dataset.shuffle(buffer_size=batch_size)\n",
    "```\n",
    "Note that normally shuffling should precede the batching step, ensuring that the entry order is randomized within each batch as well.\n",
    "\n",
    "* #### Batching with Padding\n",
    "To facilitate batch processing in the model, we will group every `batch_size` number of rows into one batch. A special requirement in batching natural language dataset is to ensures that all sequences in a batch are of the same length. This is achieved by padding, which is included in the `padded_batch` method.\n",
    "```python\n",
    "dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),\n",
    "                                                     'token_type_ids': (None, 0),\n",
    "                                                     'attention_mask': (None, 0)})\n",
    "```\n",
    "So far, `padded_batch` only works on GPU platforms that supports dynamic shape of tensors. If you are working with Ascend, you need to use the `batch` method:\n",
    "```python\n",
    "is_ascend = mindspore.get_context('device_target') == 'Ascend' # Check whether the platform is Ascend\n",
    "if is_ascend:\n",
    "    dataset = dataset.batch(batch_size)\n",
    "```\n",
    "\n",
    "* #### Taking a Subset\n",
    "\n",
    "Sometimes, you might want to train or test on a smaller subset of the data, for example to debug training process. For this purpose, use the `take` method, which select the specified number (`take_len`) of entries from the dataset:\n",
    "```python\n",
    "dataset = dataset.take(take_len)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a8faea9-a184-405f-ae06-503c0a694c58",
   "metadata": {},
   "source": [
    "Now apply the preprocessing function to the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1dfce37b-790a-49c5-8024-c20601bf1c3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 4 # Size of each batch\n",
    "processed_dataset_train = process_dataset(imdb_train, tokenizer, batch_size=batch_size, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d4846f-0281-438c-8cf5-bf752f3aec25",
   "metadata": {},
   "source": [
    "Check the processed dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "430fa5bc-4aa1-4359-8362-3a42855d24fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': Tensor(shape=[4, 245], dtype=Int64, value=\n",
      "[[ 101, 1188, 1110 ...    0,    0,    0],\n",
      " [ 101, 1188, 2523 ...    0,    0,    0],\n",
      " [ 101, 1188, 1110 ...    0,    0,    0],\n",
      " [ 101, 1188, 2523 ...  107,  119,  102]]), 'token_type_ids': Tensor(shape=[4, 245], dtype=Int64, value=\n",
      "[[0, 0, 0 ... 0, 0, 0],\n",
      " [0, 0, 0 ... 0, 0, 0],\n",
      " [0, 0, 0 ... 0, 0, 0],\n",
      " [0, 0, 0 ... 0, 0, 0]]), 'attention_mask': Tensor(shape=[4, 245], dtype=Int64, value=\n",
      "[[1, 1, 1 ... 0, 0, 0],\n",
      " [1, 1, 1 ... 0, 0, 0],\n",
      " [1, 1, 1 ... 0, 0, 0],\n",
      " [1, 1, 1 ... 1, 1, 1]]), 'labels': Tensor(shape=[4], dtype=Int32, value= [1, 1, 1, 1])}\n"
     ]
    }
   ],
   "source": [
    "print(next(processed_dataset_train.create_dict_iterator()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f7430a7-64f4-4fac-8cb1-737578cd2d8f",
   "metadata": {},
   "source": [
    "## Preprocess with `BaseMapFunction`\n",
    "An alternative way to preprocess the dataset for training is through the `BaseMapFunction` from MindNLP. You can modify the `BaseMapFunction` to create your mapping function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "83744acb-450f-4a80-b733-47b948b615c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import mindspore as ms\n",
    "from mindnlp.dataset import BaseMapFunction\n",
    "\n",
    "class ModifiedMapFunction(BaseMapFunction):\n",
    "    def __call__(self, text, label):\n",
    "        tokenized = tokenizer(text, max_length=512, padding='max_length', truncation=True)\n",
    "        labels = label.astype(ms.int32)\n",
    "        return tokenized['input_ids'], tokenized['token_type_ids'], tokenize['attention_mask'], labels\n",
    "\n",
    "map_fn = ModifiedMapFunction(['text', 'label'], ['input_ids', 'token_type_ids', 'attention_mask', 'labels'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f963bdaa-66ff-486c-adca-6eac666d650b",
   "metadata": {},
   "source": [
    "The modified map function will take the text and label from each entry, tokenize the text, cast the label into type `Int32` and output the input_ids, token_type_ids, attention_mask and labels.\n",
    "\n",
    "Note that the names of input and output columns are defined only when the map function is instantiated.\n",
    "\n",
    "You may notice that the map function does not involve the batching operation. This is because the `Trainer` class offers internal batching functionality, which can be enabled by setting the `per_device_train_batch_size` parameter in the `TrainingArgument` object.\n",
    "\n",
    "Let's now pass the `map_fn` into the `Trainer` together with other arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7ac47a93-789d-4984-b872-0233caa84bb6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The following parameters in checkpoint files are not loaded:\n",
      "['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']\n",
      "The following parameters in models are missing parameter:\n",
      "['classifier.weight', 'classifier.bias']\n"
     ]
    }
   ],
   "source": [
    "from mindnlp.engine import Trainer, TrainingArguments\n",
    "from mindnlp.transformers import AutoModelForSequenceClassification\n",
    "\n",
    "model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)\n",
    "training_args = TrainingArguments(\n",
    "    output_dir='../../output',\n",
    "    per_device_train_batch_size=16\n",
    ")\n",
    "\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=imdb_train,\n",
    "    map_fn=map_fn,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
